Data Reanalysis

Marcel Ferreira

4/25/23

About me

  • Medical Physicist (2011-2014);

  • Master of Science in Biotechnology (2017);

  • PhD in Biotechnology (2023);

About me

I started in 2013 during my undergraduate degree with protein structure.

Experimental SAXS curve of SCLam and fitting procedure with simulated curves.

Experimental SAXS curve of SCLam and fitting procedure with simulated curves.

About me

  • Since 2015 I work focused on bone tissue, especially with cell-biomaterial interaction;

  • Masters: Kinome;

  • PhD: Transcriptome/Proteome/Epigenome;

Data

Data to wisdom

Data

Data to conspiracy

Bioinformatics

Data science vs Bioinformatics.

NIH: “Bioinformatics, as related to genetics and genomics, is a scientific subdiscipline that involves using computer technology to collect, store, analyze and disseminate biological data and information, such as DNA and amino acid sequences or annotations about those sequences.”

Biological data

Biological data

Biological data - Databases

Biological data - Databases

Consortiums

Biological data - Databases

  • These are biological science (Online) libraries, collected from scientific experiments, published literature, high-throughput experiment technology, and computational analysis;

  • Structured annotations!

  • Provide access to data programmatically (API);

  • Data is made freely available under certain licenses;

Biological data - Experiments

Gene Expression Omnibus

ArrayExpress

Microarrays, RNA-Seq, miRNA-Seq, Proteomics (few), CHIP-seq, ATAC-seq, Kinase Array, Single Cell, …

Biological data

Why you MUST publish your data from high-throughput experiments?

  • It allows reuse by the scientific community;

    • Reduction of waste: Money, reagents, LIVES;
  • Peer-review;

    • Open-data/Open science;

    • Good practices;

RNAseq

RNAseq

Advancing RNA-Seq analysis, Nature Biotechnology 28:421-423

GEO-GEO2R

  • Gene Expression Omnibus;

  • NCBI-NIH;

  • About 4350 datasets;

  • Multiple species;

  • Array- and sequence-based data are accepted;

Gene Expression Omnibus

GEO-GEO2R

  • GEO2R is an interactive web tool that allows users to compare two or more groups of Samples in a GEO Series in order to identify genes that are differentially expressed across experimental conditions.

  • Microarray - LIMMA;

  • RNAseq - DeSeq2(BETA);

Gene Expression Omnibus

GEO-GEO2R

Query: “osteogenesis”/ “mus musculus”/ “Expression profilling by array”;

GEO-GEO2R

GEO-GEO2R

GEO-GEO2R

GEO-GEO2R

GEO-GEO2R

GEO-GEO2R

GEO-GEO2R

GEO-GEO2R

GEO-GEO2R

GEO-GEO2R

GEO-GEO2R

GEO-GEO2R

GEO-GEO2R

GEO-GEO2R

GEO-GEO2R

  • Multi Omics;
  • GEO2R not available for all data;

ArrayExpress

  • ArrayExpress;

  • EBI-EMBL;

  • 76.801 datasets;

  • Multiple species;

  • Array- and sequence-based data are accepted;

  • No tool for analysis;

ArrayExpress

ArrayExpress

ArrayExpress

ArrayExpress

ArrayExpress

ArrayExpress

ArrayExpress

ArrayExpress

ArrayExpress

  • Data matrix (Or Expression matrix):

    • Values of counts read by the equipment;

    • Processed OR not;

  • Investigation Design Format (IDF):

    • Set of protocols for this experiment;
  • Sample and Data Relationship Format (SDRF):

    • Metadata about the samples;
  • Array Designs Format (ADF);

    • Information about the probes;

ArrayExpress

R and Bioconductor

  • R programming language;

  • Bioconductor is a repository of bioinformatic packages;

  • For this type of analysis:

    • {Biobase};

    • {Limma}, {edgeR} or {DESeq};

    • {AnnotationDbi};

    • {org.Hs.eg.db} -varies for each analyzed organism;

    • {tidyverse}*;

ArrayExpress

  • Import the data ({readr});

  • ExpMat: ExpressionMatrix or assayData;

    • Numeric matrix;
  • adf : Array description file or featureData:

    • Data.frame;
  • sdf : Sample description file or phenoData;

    • Data.frame.

ArrayExpress

  • Data cleaning;

  • Golden rule:

    • number of ExpMat rows equals the number of adf rows.

    • row names and row order of ExpMat equals of adf rows.

    • number of ExpMat columns equals the number of sdf rows.

    • column names and order of ExpMat equals of sdf rows.

ArrayExpress

  • ExpressionSet: R object from the {Biobase};

  • AnnotatedDataFrame function on adf and sdf;

    adf_final <- Biobase::AnnotatedDataFrame(adf)
    sdrf_final <- Biobase::AnnotatedDataFrame(sdf)
    
    eset <- ExpressionSet(assayData = ExpMat,
                          phenoData = sdf_final,
                          featureData = adf_final)
    •   exprs(eset) # extract the expression matrix
        pData(eset) # extract the pheno data
        fData(eset) # extract the feature data

ArrayExpress

LIMMA

  • Quality control: plotDensities(), PCA;

  • Background correction: backgroundCorrect() ;

  • Normalization: normalizeBetweenArrays();

    exprs(eset) <- normalizeBetweenArrays(exprs(eset),
                                          method = "vsn")

ArrayExpress

LIMMA - differential expression

  • Create a design: model.matrix();

  • Fit the data: lmFit();

  • Make the contrast: makeContrasts();

  • Fit the contrast: contrasts.fit();

  • Compute the statistic: eBayes();

    design <- model.matrix(~ 0 + pData(eset)$factor)
    
    fit <- lmFit(exprs(eset),
                design)
    
    cm <- makeContrasts(c1 = "factor1 - factor2",
                        levels = design)
    
    fit2 <- contrasts.fit(fit, cm)
    
    fit2 <- eBayes(fit2)

ArrayExpress

LIMMA - differential expression

  • Extract the differential expressed genes table: topTable();

    topTable(fit2,
             coef = 1,
             adjust.method = "BH",#Benjamin hocheback - FDR
             number = Inf #show all genes
             )

ArrayExpress

TCGA

  • SNPs;

  • RNAseq;

  • miRNAseq;

  • Methylation;

  • Survival Analysis;

TCGA

  • {TCGABiolinks};

  • R Package;

  • Facilitates data access and analysis.

TCGA

miRNA example - Pancreatic Ductal Adenocarcinoma Study

CancerProject <- "TCGA-PAAD"
DataDirectory <- paste0("/GDC/",gsub("-","_",CancerProject))
FileNameData <- paste0(DataDirectory, "_","miRNA_gene_quantification",".rda")

query.miR <- GDCquery(project = CancerProject, 
                      data.category = "Transcriptome Profiling",
                      data.type = "miRNA Expression Quantification",
                      #file.type = "hg19.mirna",
                      legacy = FALSE)

TCGA

samplesDown.miR <- getResults(query.miR,cols=c("cases"))

dataSmTP.miR <- TCGAquery_SampleTypes(barcode = samplesDown.miR,
                                      typesample = "TP")

dataSmNT.miR <- TCGAquery_SampleTypes(barcode = samplesDown.miR,
                                      typesample = "NT")


queryDown.miR <- GDCquery(project = CancerProject, 
                          data.category = "Transcriptome Profiling",
                          data.type = "miRNA Expression Quantification",
                          #file.type = "hg19.mirna",
                          legacy = FALSE,
                          barcode = c(dataSmTP.miR, dataSmNT.miR))

TCGA

GDCdownload(query = queryDown.miR,
            directory = DataDirectory)

dataAssy.miR <- GDCprepare(query = queryDown.miR, 
                           save = TRUE, 
                           save.filename = FileNameData, 
                           summarizedExperiment = TRUE,
                           directory =DataDirectory )

rownames(dataAssy.miR) <- dataAssy.miR$miRNA_ID

TCGA

# using read_count's data 
read_countData <-  colnames(dataAssy.miR)[grep("count", colnames(dataAssy.miR))]
dataAssy.miR <- dataAssy.miR[,read_countData]
colnames(dataAssy.miR) <- gsub("read_count_","", colnames(dataAssy.miR))

dataFilt <- TCGAanalyze_Filtering(tabDF = dataAssy.miR,
                                  method = "quantile", 
                                  qnt.cut =  0.25)   

dataDEGs <- TCGAanalyze_DEA(mat1 = dataFilt[,dataSmNT.miR],
                            mat2 = dataFilt[,dataSmTP.miR],
                            Cond1type = "Normal",
                            Cond2type = "Tumor",
                            fdr.cut = 0.01 ,
                            logFC.cut = 1,
                            method = "glmLRT") 

TCGA

Next steps (suggestions)

  • Enrichment analysis: {clusterProfiler}

    • Gene Onology;

    • KEGG pathways;

  • Gene set enrichment: {fGSEA}, {GSVA}

  • Network interactions

Thanks you!

Marcel Ferreira, PhD (@marceelrf)

https://quartodomarcel.netlify.app/

https://marcel-ferreira.shinyapps.io/SciDashboard_marceelrf/

References

  • https://learn.gencore.bio.nyu.edu/rna-seq-analysis/

  • https://star-protocols.cell.com/protocols/931

  • https://statquest.org/

  • https://www.youtube.com/watch?v=tlf6wYJrwKY

  • https://home.proffernandamaciel.com.br/

  • https://sydney-informatics-hub.github.io/training-RNAseq-slides/01_IntroductionToRNASeq/01_IntroductionToRNASeq.html#1

  • https://bioconductor.org/packages/release/bioc/vignettes/limma/inst/doc/usersguide.pdf